Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Don't attempt package installation while another installation might still be running #84190

Merged
merged 1 commit into from
Dec 1, 2020

Conversation

skh
Copy link
Contributor

@skh skh commented Nov 24, 2020

Summary

Implements #75810

This changes _install_package() to return early when it detects that the package might be in the process of being installed by another process.

It does so by checking the install_status and install_started_at fields on the epm-packages saved object. If the previous installation was started longer than MAX_TIME_COMPLETE_INSTALL (currently, 60 seconds) ago, the installation will be attempted again.

MAX_TIME_COMPLETE_INSTALL is also used by ensurePackagesCompletedInstall() to decide whether to reattempt package installations. This should not break, or be broken by, the change made in this PR.

How to test this

I've created https://gist.github.com/skh/cc695952031c9e349874b898c7066e42 to test this locally. On my system, with a local ES instance

  • when the two installation attempts are about 2-5 seconds apart, the behavior is as expected: the second installation attempt returns early and lists the assets that have been installed by the first attempt up until then. That means the second attempt lists less assets in its response than the first.
  • when the two installation attempts are more than 10 seconds apart, the first one completes before the second is started, so two complete installations are performed. Both attempts list the same number of assets in their responses.
  • when the two installation attempts are really close together, we still run into a race condition with the saved object itself. I can trigger this by sending the two install attempts within about 500 milliseconds of each other:
Response for second install attempt: {"statusCode":404,"error":"Not Found","message":"Saved object [epm-packages/apache] not found"}
Response for first install attempt: {"statusCode":409,"error":"Conflict","message":"[epm-packages:apache]: version conflict, required seqNo [1879], primary term [1]. current document has seqNo [1880] and primary term [1]: version_conflict_engine_exception"}
Starting package delete
Response for delete attempt: {"statusCode":400,"error":"Bad Request","message":"apache is not installed"}

Fixing this is beyond the scope of this PR.

UPDATE: after discussing with the Kibana Core team, the 409 error above happens when the saved object version has changed between a read and a subsequent update operation. We can use SavedObjectsClient.errors.isConflictError() to catch this specific case, and repeat the check for installed_status and install_started_at.

Follow-up issues: #84651 and #84656

@skh skh self-assigned this Nov 24, 2020
@skh skh added Feature:Fleet Fleet team's agent central management project release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v7.11.0 v8.0.0 labels Nov 24, 2020
@skh skh changed the title Return early when parallel install process detected [Fleet] Don't attempt package installation while another installation might still be running Nov 24, 2020
@skh skh force-pushed the 75810-dont-attempt-concurrent-install branch from 3ebb4bb to e0aad6c Compare November 30, 2020 10:23
@skh skh marked this pull request as ready for review November 30, 2020 10:56
@skh skh requested a review from a team November 30, 2020 10:56
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Feature:Fleet)

@skh
Copy link
Contributor Author

skh commented Nov 30, 2020

@neptunian and @jonathan-buttner I would be very interested to hear if you think this change is worth it at all.

@jonathan-buttner
Copy link
Contributor

Hmm, @skh do you think returning a partial list of the assets is helpful to the requester? I wonder if it'd be better to return some status code that indicates that an installation request is already running? Or that the service is busy and the requester should try again later. That way it could try, then wait for a bit and try again and receive the complete list of assets.

@skh
Copy link
Contributor Author

skh commented Nov 30, 2020

do you think returning a partial list of the assets is helpful to the requester?

Good point. I was thinking of how the UI currently uses the return value, and by returning just a list, albeit a shorter or empty one, the UI doesn't need additional changes for what should be a rare edge case.

@jonathan-buttner
Copy link
Contributor

do you think returning a partial list of the assets is helpful to the requester?

Good point. I was thinking of how the UI currently uses the return value, and by returning just a list, albeit a shorter or empty one, the UI doesn't need additional changes for what should be a rare edge case.

Gotcha, the Security Solution app calls the bulk upgrade endpoint which uses this code path I believe. We don't check the return value so we're fine there. I still think it might be worth refactoring this in the future because other apps might call the apis frequently which could cause the issue to happen more often.

@skh
Copy link
Contributor Author

skh commented Dec 1, 2020

@jonathan-buttner Thanks! I've opened #84656 and #84651 as follow-up issues.

@skh skh added the Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project label Dec 1, 2020
@skh skh deleted the 75810-dont-attempt-concurrent-install branch December 1, 2020 15:51
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

gmmorris added a commit to gmmorris/kibana that referenced this pull request Dec 1, 2020
* master: (63 commits)
  Revert the Revert of "[Alerting] renames Resolved action group to Recovered (elastic#84123)"  (elastic#84662)
  declare kbn/monaco dependency on kbn/i18n explicitly (elastic#84660)
  Remove unscripted fields from sample data index-pattern saved objects (elastic#84659)
  [ML] Fix unnecessary trigger of wildcard field type search for ML plugin routes. (elastic#84605)
  Update create.asciidoc (elastic#84046)
  [Security Solution][Detections] Fix labels and issue with mandatory fields (elastic#84525)
  Fix flaky test suite (elastic#84602)
  [Security Solution] [Detections] Create a 'partial failure' status for rules (elastic#84293)
  Revert "[Alerting] renames Resolved action group to Recovered (elastic#84123)"
  Update code-comments describing babel plugins (elastic#84622)
  [Security Solution] [Cases] Cypress for case connector selector options (elastic#80745)
  [Discover] Unskip doc table tests (elastic#84564)
  [Lens] (Accessibility) Improve landmarks in Lens (elastic#84511)
  [Lens] (Accessibility) Focus mistakenly stops on righthand form (elastic#84519)
  Return early when parallel install process detected (elastic#84190)
  [Security Solution][Detections] Support arrays in event fields for Severity/Risk overrides (elastic#83723)
  [Security Solution][Detections] Fix grammatical error in validation message for threshold field in "Create new rule" -> "Define rule" (elastic#84490)
  [Fleet] Update agent details page  (elastic#84434)
  adding documentation of use of NODE_EXTRA_CA_CERTS env var (elastic#84578)
  [Search] Integrate "Send to background" UI with session service (elastic#83073)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project Feature:Fleet Fleet team's agent central management project release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v7.11.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants